50 research outputs found

    Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

    Full text link
    NVDLA is an open-source deep neural network (DNN) accelerator which has received a lot of attention by the community since its introduction by Nvidia. It is a full-featured hardware IP and can serve as a good reference for conducting research and development of SoCs with integrated accelerators. However, an expensive FPGA board is required to do experiments with this IP in a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA, it would be hard to do accurate performance analysis with such a setup. To overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the Amazon cloud FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We then evaluate the performance of NVDLA by running YOLOv3 object-detection algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3. We further analyze the performance by showing that sharing the last-level cache with NVDLA can result in up to 1.56x speedup. We then identify that sharing the memory system with the accelerator can result in unpredictable execution time for the real-time tasks running on this platform. We believe this is an important issue that must be addressed in order for on-chip DNN accelerators to be incorporated in real-time embedded systems.Comment: Presented at the 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2'19

    Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

    Full text link
    Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet^{\dagger}. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only 1×11\times 1 convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7\% top-5), but with 42×\times fewer parameters and 48×\times fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1\% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 66.3 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 11.6×\times.Comment: Update to the latest result

    Extensive analysis of D7S486 in primary gastric cancer supports TESTIN as a candidate tumor suppressor gene

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High frequency of loss of heterozygosity (LOH) was found at D7S486 in primary gastric cancer (GC). And we found a high frequency of LOH region on 7q31 in primary GC from China, and identified D7S486 to be the most frequent LOH locus. This study was aimed to determine what genes were affected by the LOH and served as tumor suppressor genes (TSGs) in this region. Here, a high-throughput single nucleotide polymorphisms (SNPs) microarray fabricated in-house was used to analyze the LOH status around D7S486 on 7q31 in 75 patients with primary GC. Western blot, immunohistochemistry, and RT-PCR were used to assess the protein and mRNA expression of TESTIN (TES) in 50 and 140 primary GC samples, respectively. MTS assay was used to investigate the effect of TES overexpression on the proliferation of GC cell lines. Mutation and methylation analysis were performed to explore possible mechanisms of TES inactivation in GC.</p> <p>Results</p> <p>LOH analysis discovered five candidate genes (<it>ST7</it>, <it>FOXP2</it>, <it>MDFIC</it>, <it>TES </it>and <it>CAV1</it>) whose frequencies of LOH were higher than 30%. However, only <it>TES </it>showed the potential to be a TSG associated with GC. Among 140 pairs of GC samples, decreased <it>TES </it>mRNA level was found in 96 (68.6%) tumor tissues when compared with matched non-tumor tissues (<it>p </it>< 0.001). Also, reduced TES protein level was detected in 36 (72.0%) of all 50 tumor tissues by Western blot (<it>p </it>= 0.001). In addition, immunohistochemical staining result was in agreement with that of RT-PCR and Western blot. Down regulation of TES was shown to be correlated with tumor differentiation (<it>p </it>= 0.035) and prognosis (<it>p </it>= 0.035, log-rank test). Its overexpression inhibited the growth of three GC cell lines. Hypermethylation of <it>TES </it>promoter was a frequent event in primary GC and GC cell lines. However, no specific gene mutation was observed in the coding region of the <it>TES </it>gene.</p> <p>Conclusions</p> <p>Collectively, all results support the role of <it>TES </it>as a TSG in gastric carcinogenesis and that <it>TES </it>is inactivated primarily by LOH and CpG island methylation.</p

    Full Stack Optimization of Transformer Inference: a Survey

    Full text link
    Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference
    corecore